Text Manipulation

Introduction

In this chapter, we explore various text manipulation techniques in Python. More specifically, we start by discussing how to handle strings, focusing on printing and combining them. Next, we introduce the most widely used methods and functions for handling strings. Lastly, we cover how to identify and work with patterns in text data, a concept known as regular expressions.

From Numeric Data to Strings

When working with numeric data, it is quite intuitive to perform operations like addition or multiplication on vectors. However, manipulating strings (or character data), one of the core data types in Python, requires specific functions. String manipulation can become complex, especially when combining strings from a single vector or different columns of a data frame. With text data, we can perform tasks such as adding or replacing text, finding matches, counting letters, locating positions of specific text characters, and much more.

Printing Strings

We can use single quotes ('') or double quotes ("") to specify a value (any value) as a string. For instance, suppose we want to print one of the most well known phrases in the Computer Science, Data Science and Data Engineering world, "Hello world!". We can print this phrase with the print() function, enclosing the text in single or double strings:

# Printing with single quotes
print('Hello world!')
Hello world!
# Printing with double quotes
print("Hello world!")
Hello world!

In both cases, we see that we get the exact same results. However, what happens if we need to have double or single quotes within a string? Since Python would not know which quotes we want to include in the string, we need to be able to clarify which quotes are part of the text itself, and which quotes are used to indicate a string. To do so, we need to use what is called an escape sequence. For this, we use the special character backslash (\) before the single or double quotes that we want to include in the string:

# Printing "I want to print "Hello World""
print("I want to print \"Hello World!\"")
I want to print "Hello World!"

Another way is to use single quotes around the string itself if we want to include double quotes inside the text (or the other way around); this avoids the need for escape sequences in some cases:

# Printing with double quotes
print('I want to print "Hello World!"')
I want to print "Hello World!"
# Printing with single quotes
print("I want to print 'Hello World!'")
I want to print 'Hello World!'

Combining Strings

Python provides several ways to combine strings together. One common approach is to use the + operator. For example, we can combine two separate strings to create the phrase "Hello World!":

# Combining strings with +
"Hello" + " World!"
'Hello World!'

We can also store strings in variables and combine them:

# Combining variables
word1 = "Hello"
word2 = "World!"

# Printing results
print(word1 + " " + word2)
Hello World!

Notice that we included a space " " between the two words. Without this space, the result would be "HelloWorld!".

Sometimes we may also need to combine strings with values of other data types, such as integers or floats. In these cases, we need to convert the value into a string first using the function str(). The following code highlights this concept:

# Combining strings with numbers
age = 33

# Printing sentence
print("I am " + str(age) + " years old.")
I am 33 years old.

The str() function converts a value into a string so that it can be combined with other strings.

Another common approach is to use the method join(). This method allows us to combine multiple strings using a separator:

# Combining strings with join()
" ".join(["Hello", "World!"])
'Hello World!'

The separator is placed before join(). In the example above, the separator is a space " ". We can use other separators as well. For instance, suppose we want to print the string "Data-Science":

# Combining with a different separator
"-".join(["Data", "Science"])
'Data-Science'

If we want to combine all elements of a list into a single string, we can again use the join() method in the following way:

# Combining all elements into one string
words = ["Data", "Science", "Analytics"]

# Combining with a different separator
" and ".join(words)
'Data and Science and Analytics'

A modern and very readable way to combine strings in Python is with f-strings. An f-string is a normal Python string with an "f" placed before the quotes that allows you to embed variables or expressions directly inside the text using curly braces {}. Unlike a regular string, which treats everything as plain text, an f-string evaluates what is inside {} and replaces it with its actual value when the code runs.

# Using f-strings
field = "Data Science"

# Printing results
print(f"I am learning {field}!")
I am learning Data Science!

When combining strings, it is always a good idea to print the results and verify that the output matches what we expect, especially when working with lists or loops.

Use of f-Strings in GenAI Prompts

In GenAI applications, f-strings are commonly used to dynamically build prompts by inserting variables (such as user input, model outputs, or parameters) directly into text. This makes it easy to create flexible and context-aware instructions for language models, since the prompt can change automatically depending on the data being passed in.

String Functions

Python provides a wide range of built-in tools for text manipulation. Unlike other environments that rely on separate packages, Python handles most string operations directly through the str type and its associated methods. This makes string manipulation both consistent and easy to use.

  • Methods are applied directly to strings (e.g., "text".method()).

  • Many operations are available without importing extra libraries.

  • Pattern-based operations use the re module when needed.

Let’s start by creating a list of strings:

# Quotes
quotes = ["Become a Master in Data Science.", 
          "The best way to learn data science is to do data science.", 
          "Text mining is an essential skill."]

With this list, we can experiment with different string methods to perform common text operations. For instance, suppose we want to check whether the string "is" exists within each element. We can do this using the in operator inside a list comprehension:

# Is the pattern in the string?
["is" in q for q in quotes]
[False, True, True]

In this code, Python goes through each element in the list quotes one by one. For each element (represented by q), it checks whether the substring "is" is present. The expression "is" in q returns True if the pattern is found and False otherwise. The list comprehension collects all these results and returns a new list of Boolean values corresponding to each sentence in quotes. As expected, we get False, True, and True because the string "is" appears in the second and third elements but not in the first one.

Another useful operation is identifying which elements contain a specific pattern. This can be done using enumerate() together with a condition:

# Returning the indexes of entries that contain the pattern
[i for i, q in enumerate(quotes) if "is" in q]
[1, 2]

In this code, enumerate(quotes) lets us loop through the list while keeping both the index (i) and the text (q). For each element, we check whether "is" appears in the string using "is" in q. If the condition is True, we keep the index. The result is a list of positions where the pattern occurs in quotes.

Regarding subsetting strings, Python allows both position-based slicing and pattern-based filtering. Slicing can be used to extract specific parts of a string based on character positions. For example, the code below takes the first 6 characters of each string in the list by using [:6], which means “start from the beginning and stop before index 6”:

# Extracting the first 6 characters
[q[:6] for q in quotes]
['Become', 'The be', 'Text m']

If we want to check whether a pattern exists in a more explicit way, Python also provides helpful string methods such as find(), which returns the position of the first occurrence of a substring. Note that if the substring is not found, the method returns -1.

# Finding position of "is" in each string
[q.find("is") for q in quotes]
[-1, 35, 12]

Lastly, we can split strings into parts using the method split(). This divides a string based on a specified separator and returns a list of components:

# Splitting the quotes
[q.split("is") for q in quotes]
[['Become a Master in Data Science.'], ['The best way to learn data science ', ' to do data science.'], ['Text mining ', ' an essential skill.']]

There are many other useful string methods in Python, but the examples above cover the most commonly used ones. These include operations for searching, slicing, splitting, and filtering text data. It is recommended to revisit these methods when working with real datasets, as string manipulation is a key part of data cleaning and preprocessing. The table below provides an overview regarding the ones most commonly used. It is advisable to come back and check this table when we want to solve a task that includes strings.

Method or Function

Description

str.contains()

Is the pattern in the string?

str.find()

Return position of first occurrence of a substring

str.replace()

Replace a substring with another string

str.upper()

Convert all characters to uppercase

str.lower()

convert all characters to lowercase

len()

Return number of characters in a string

str.strip()

Remove whitespace from start and end of a string

sorted()

Return strings in sorted (alphabetical) order

str.join()

Combine multiple strings using a separator

str.split()

Split a string into parts based on a separator

Regular Expressions

In Python, regular expressions are pattern-matching tools that enable the concise and flexible manipulation of text data by providing a syntax for specifying search patterns and facilitating string matching and manipulation operations. Put simply, we use regular expressions to describe patterns in strings (Friedl, 2006). To understand what this means and how we can use regular expressions, we will use the string "Data!" and the function search() from Python’s built-in re module:

import re

# Checking regular expression for "Data!"
re.search("^....!", "Data!") is not None
True

What exactly is this pattern? As we see, we just matched the pattern of "Data!" using a sequence of special characters. The special character caret (^) signifies the start of a string, without considering (or representing) the first letter. Then, we used the special character dot (.) 4 times because a dot represents a single letter in our string. Since the word "Data" contains 4 letters, we used dot (.) 4 times to capture the pattern. Lastly, we included the special character exclamation mark (!) because it appears in our string. As a result, we described the pattern of the string "Data!" fully and that is why we got True as an output. It is important to understand that the exact same regular expression would describe similar strings such as "Math!" or “Stat!" as the pattern is exactly the same (4 letters, followed by an exclamation mark (!)):

# Checking regular expression for "Math!"
re.search("^....!", "Math!") is not None
True
# Checking regular expression for "Stat!"
re.search("^....!", "Stat!") is not None
True

That is actually the difference between regular expressions and using the exact same value of a string as a pattern. Had we used the value "Data!" in the first argument—the argument that describes the pattern—pattern, of course we would get the output True in the first example but we would get False in the other two examples:

# Checking regular expression for "Data!" with the pattern "Data!"
re.search("Data!", "Data!") is not None
True
# Checking regular expression for "Math!" with the pattern "Data!"
re.search("Data!", "Math!") is not None
False
# Checking regular expression for "Stat!" with the pattern "Data!"
re.search("^Data!", "Stat!") is not None
False

The main point, of course, is understanding why regular expressions are useful. Regular expressions are very useful when it comes to text data manipulation. For instance, suppose we have a list that describes the body weight of five people:

# Creating a list
body_weight = ["75 KG", "82 KG", "85 KG", "68 KG", "79 KG"]

# Printing body_weight 
body_weight
['75 KG', '82 KG', '85 KG', '68 KG', '79 KG']

In this case, the values are stored as strings because they include the unit "KG". However, for analysis, we are typically only interested in the numeric part. To clean the data, we use a regular expression inside the sub() function, which replaces a matched pattern with an empty string:

# Remove " KG" from each element
body_weight_clean = [re.sub(r"\sKG$", "", x) for x in body_weight]

body_weight_clean
['75', '82', '85', '68', '79']

The code loops through each element in the list body_weight and applies the substitution one by one. For each string x, the pattern r"\sKG$" is used. The r before the string indicates a raw string, which ensures that backslashes (such as \s) are interpreted correctly as part of the regular expression rather than as escape characters in Python. This pattern looks for a space (\s) followed by the letters "KG" at the end of the string ($). When this pattern is found, it is replaced with "" (an empty string), effectively removing it. As a result, each element such as "75 KG" becomes "75", "82 KG" becomes "82", and so on. The output is a new list called body_weight_clean that contains the cleaned values without the unit, making the data more suitable for numerical analysis.

This simple example clearly illustrates the value of regular expressions. However, regular expressions can be very confusing, especially when we are working with more complicated strings. It makes sense if regular expressions seem confusing at the beginning, and they seem hard to remember. For this reason, it is strongly recommended to practice them using simple, single-string examples first, rather than immediately applying them to large datasets. By testing patterns on individual cases, we can better understand how each symbol behaves and ensure that the pattern correctly captures what we intend before scaling it up to more complex data.

Symbol

Description

^

Start of a string

$

End of a string

.

Any single character

\d

Digit

*

Zero or more occurences

+

One or more occurences